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Abstract 

We study the problem of incorporating covariates in a compound decision setup. 
It is desired to estimate the means of n response variables, which are independent and 
normally distributed, and each is accompanied by a vector of covariates. We suggest a 
method that involves non-parametric empirical Bayes techniques and may be viewed 
as a generalization of the celebrated Fay-Herriot (1979) method. 

Some optimality properties of our method are proved. We also compare it numer- 
ically with Fay-Herriot and other methods, using a 'semi-real' data set that involves 
spatio-temporal covariates, where the goal is to estimate certain proportions in many 
small areas (Statistical- Areas). 

1 Introduction 

The main purpose of this paper is to study and demonstrate how to incorporate compound 
decision techniques (CD), or almost equivalently, empirical Bayes (EB) methods, in the 
presence of explanatory variables. The ideas of CD/EB were developed in the 1950's by 
Robbins (1951, 1955, 1964), see the review papers by Copas (1969) and Zhang (2003). 
Compound decision (or Empirical Bayes) procedures, were shown to produce very efficient 
estimators in the simple setup where we have independent observations, Y 1: . . . , Y n , Y± ~ F w , 
and it is desired to estimate //j, % = 1, . . . , n. A major case, on which we will concentrate, is 
when F M . = N(^i, 1). 

We will focus on two types of EB procedures. One type is Parametric Empirical Bayes 
(PEB) procedure, where fii, i — 1, . . . ,n are assumed to be realizations of independent 
random variables Mi, i = l,...,n, Mj ~ G, G = iV(0,r 2 ), where r 2 is unknown and 
should be estimated from the data. When n is large, the corresponding estimator (note, 
the exact variant of the corresponding estimator, depends on the method of estimating r 2 ), 
resembles the James-Stein estimator, see e.g., Efron and Morris (1973). The other type is 
the Non-Parametric Empirical Bayes (NPEB) procedure, where the above distribution G is 
a member of a large non parametric family Q of distributions. Two recent NPEB methods 
and approaches are Brown and Greenshtein (2009) and Jiang and Zhang (2009). 
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The advantage of EB procedures, relative to more elementary (e.g., mle) procedures, 
occures as n grows, and may become very significant in high dimensional problems when n 
is large (e.g., n > 3 is needed for "Stein's paradox" to hold). A special advantage of NPEB 
procedures is expected in situations where the vector /x = (/ii, . . . ,/!„)' is sparse, see e.g., 
Greenshtein, Park and Ritov (2008), Brown and Greenshtein (2009). 

Since modern statistical problems often involve high dimensional and sparse estimation 
problems, EB techniques should be embraced for such purposes, see, e.g., Efron (2003). 
However, apart from literature in small area estimation, e.g., Rao (2003), which follows 
the seminal paper of Fay and Herriot (1979), EB is hardly used in modern data analysis. 
One reason is that in most applied problems, we have explanatory variables Xn, . . . , X ip for 
each observation Y and in such cases EB has no appeal, since simple symmetric decision 
procedures have no appeal. We elaborate in the following. 

In our motivating example Y ~ B(mi,pi), the binomial distribution, and we need to 
estimate pi,...,p n , certain proportions, in n (small) areas. The values of pi,...,p n are 
unknown constants to be estimated. In addition to the sample Y±, . . . ,Y n , we have a set 
of variables Xi, . . . , X n (fixed or random, but independent of Yi, . . . , Y n ) and hopefully 
Xi can serve as proxies to Pi, i = 1, . . . ,n. For example one dimensional covariates X^ ~ 
B(ki,pi) where pi are "typically" close to pf, alternatively Xi may be a vector of known 
parameters of area i, that might be "relevant" to the parameter of interest pi, for example 
the socio-economic level of the region, its size, or mean age. We emphasize two elements. 
First, because of the proxies, Yi, . . . , Y n cannot be considered as "permutation invariants" or 
"exchangeable" . Second, we do not believe that the observations follow standard regression 
models. The covariates are considered as proxies to pi, but they are statistically independent 
of the Y's (whose only stochastic aspect comes from the binomial sampling), and may be 
only a rough approximation to p\, . . . ,p n . 

Simple symmetric and permutation invariant procedures. In cases of total ignorance 
regarding the parameters of the variables in relation to their identity, e.g., a situation where 
Yi ~ N(fii, 1) and there is an exchangeable multivariate prior on (/i 1; . . . ,/x n ), procedures 
which are permutation invariant have a special appeal. Permutation invariant procedures A 
are such that for every permutation n, 

A(Yi, ...,Y n ) = (ai, . . . , a n ) -<=>- A(Y 7r (i), . . . , Y^) = (a^i), ■ a^n)); 

here G A, where A is the ( abstract) action space. A simple class of exchangeable priors 
is where /ij are realizations of i.i.d Mi ~ G, i — l,...,n. The optimal procedures then 
belong to the class of 'simple symmetric decision functions', i.e., procedures A which are of 
the form: 

A(Y 1 ,...,Y n ) = (5(Y 1 ),...,5(Y„)), 

for a given S. For natural losses, given G, the optimal 8 corresponds to the one dimensional 
Bayes procedure. On the relation and asymptotic equivalence between the above two classes, 
see Greenshtein and Ritov (2009). Given a loss function, consider an 'oracle' who knows the 
values of fj,i, . . . but is required to use a permutation invariant procedure. EB and CD 
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procedures may be viewed as an attempt to immitate the (unknown) procedure that an 
oracle would use. This is a very natural goal under 'total ignorance' or 'exchangeability'. 

The appeal in using permutation invariant procedures and consequently EB procedures, 
is lost when exchangeability is lost, as in cases where there are explanatory variables. Assume 
n = ni+n 2 and it is known that the first n\ observations (say, hormone measurements), were 
taken from men, while the last n 2 were taken from women. Applying a permutation invari- 
ant procedure is equivalent to ignoring this potentially important information/explanatory- 
variable. However not all is lost, one may still apply EB procedure seperately on the first 
n\ observations and on the last n 2 observations. The idea is that after accounting for the 
explanatory variable in this trivial manner, we arrive into (two groups of) exchangeable 
variables and applying EB procedures seperately on each group becomes appealing. In a 
similar manner, we will account for the information in the explanatory variables and then, 
after the information from the explanatory variables is accounted for and the "accounted 
observations" are closer to being exchangeable, we apply an EB procedure. 

EB and CD are closely related notions and approaches. Under an EB formulation the 
parameters fa, % = 1, . . . ,n are independent realizations from an unknown distribution G 
and the aim is to approximate the corresponding Bayes rule; under a CD formulation the aim 
is to approximate the best decision rule within a class of procedures (e.g., simple-symmetric, 
permutation invariant), for the given /i, = (fa, . . . , /i n )' . In this paper we emphasize the 
CD approach. However, we will often use the more familiar EB notion, motivation and 
terminology. 

Applying a (variant of) PEB method after accounting for the covariates, is in the spirit of 
the paper of Fay and Herriot, as shown in sub-section 2.2; it is currently the most common 
practice. Another approach for inference in the presence of explanatory variables is that 
of Lindley and Smith (1972), it is a parametric empirical Bayes approach, though different 
than that of Fay and Herriot. 

In Section 2, we will suggest how EB could naturally be incorporated in problems with 
explanatory variables. We extend the Fay-Herriot approach and present its PEB and NPEB 
versions. We show asymptotic optimality of NPEB. In section 3, we demonstrate the appli- 
cation of our suggested methods on a "semi-real" data, which is based on the recent Israeli 
census. The application involves estimation of certain population's proportions in small 
areas (Statistical Areas). The explanatory variables available when estimating the propor- 
tion pi in statistical dbTQdb % . £1X6 'Spatial' and 'Temporal', based on historical data, and data 
from neighbors. We elaborate on comparing PEB procedures, versus the more recent NPEB 
procedure, suggested by Brown and Greenshtein (2009). 

Our ideas and techniques are meaningful in a genreral setup where Y{ ~ F jUi , but will be 
presented for the case = N(fa, 1), % — 1, . . . , n. In fact, as mentioned we will apply our 
method for estimating proportions in the setup where Yj ~ B(rrii,Pi), but applying an arcsin 
transformation will bring us to the normal setup. 
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2 Collections of estimators induced by affine transfor- 
mations 



The setup we consider is where we observe vectors Vj = (Yi, Xn, . . . , Xi P ), % — 1, . . . , n, where 
Yi ~ N(fii, 1) are independent response variables, and X^ are explanatory variables indepen- 
dent of Yi, i = 1, ... ,n, j = 1, ... ,p. Denote by X nxp the matrix of the explanatory variables. 
Denote, Y' = (Y 1: . . . , Y n ). The goal is to find a 'good' estimator fi = £i(Vi, . . . , V n ), under 
the risk 

E\\jj, - n\\ 2 2 . 

In a nutshell the motivation and approach are as follows. Ideally it could be desired to 
approximate the Bayes procedure, assuming (at leat tactically) that (Vj,//j), % = 1, . . . ,n, 
are independent random vectors sampled from an unknown distribution T that belongs to a 
large non-parametric family of distributions Q. Then, the goal is to approximate the Bayes 
decision 5* = argmin^ E r \ \S(Vi) — fii\\ 2 by 8*, and let fi = . . . ,5*(V n )). However, 

this goal may be too ambitious for p + 1 dimensional observations V \ when n is moderate 
due to the "curse of dimensionality". A possible approach, in the spirit of Lindley and 
Smith (1972), is to assume that Y belongs to a convenient parametric family, and this way 
the "curse of dimensionality" and other difficulties are circumvented. The approach of Fay 
and Herriot (1979) may also be interpreted this way. We, on the other hand, aim for the best 
permutational invariant estimator with respect to Zi, . . . , Z n) where Zi are one dimensional 
random variables which are obtained by a suitable transformation of (Vi, . . . , V n ). This 
transformation is estimated from the data. 

2.1 preliminaries and definitions 

We start from a general point of view, where initially there are no covariates. We observe 
independent Yi ~ N(fii, 1), % = 1, . . . ,n. Let {T} be a collection of affine transformations 
T(Y) = Ta,b(Y) = AY — B, where A is an orthonormal matrix and B is a vector. Then 
Z = T{Y) is distributed as a multivariate normal with mean vector denoted v, v = Afj, — B, 
and covariance matrix the identity. Let A = A(Y) be a fixed estimator of the vector /x, 
which is not invariant under the group of affine transformations, i.e., A(T(Y)) ^ T(A(Y)). 
Then, the pair A and {T} defines a (non-trivial) class of decision functions {A r }, T e {T}, 

A T (Y)=T-\A(T(Y)). 

Let 

T opt = argmin£ M ||A T (r) - fj,\\ 2 2 = argmin R(T, /n); 

Tg{T} ~ Te{T} 

here R(T, /j,) is implicitly defined. 

Our goal is to approximate T opt , and then estimate \i by an approximation of A^ ov t{Y\ 
For every T G {T}, suppose we have a good estimator R(T, /x) for R(T,n). Let T = 

argmin Tg | T | R(T, fi). The usual approach, which we will follow, is to use the estimator 
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fi = Af(Y). When the class {T} is not too large, we expect only a minor affect of over- 
fitting, i.e., R(f, /Li) « R(T opt , y). 

Example 1 (Wavelet transfrom) The above formulation describes many standard tech- 
niques. In fact any harmonic analysis of the data that starts with transforming the data by a 
standard transformation (e.g., Fourier transform) follows this outline. A special case is when 
T(Y) = AY , where A is the matrix which transforms Y to a certain wavelet representation, 
then, typically, the mean of the transformed vector is estimated and transformed back, see 
Donoho and Johnstone (1994). Suppose that, {T} = {A} is a collection of matrices that 
correspond to a collection of wavelet bases/ "dictionaries". The problem of finding the most 
appropriate basis/transformation, is related to that of basis-pursuit, see e.g., Chen, et.al. 
(2001). The permutational invariant and non-linear decision functions A in those studies 
is s oft /hard-thres olds , Lasso, etc. As mentioned, procedures of a special interest for us are 
parametric and non-parametric EB. 

Example 2 (Regression) Suppose that in addition to Y there is a fixed (deterministic!) 
matrix X G R nxp . Consider the class of transformations T(Y) = Y — B, B G {B}, where 
{B} is the collection of all vectors of the form B = X(3, (3 G R p . Note, in particular, that 
our transformations are non-random. 

Remark 1 The formulation for a random set {T}, which is independent of Y is just the 
same. In the last example when X nxp is random, we condition on the explanatory variables 
and arrive to a conditional inference version of the developement in the sequel. From a 
Bayesian perspective, assuming a joint distribution T as above, conditional independence of 
the random set {T} and Y , conditional on the covariates, follows when we assume that Y 
and X nxp are independent conditional on y. We will remark later on the case where the 
random set of transformations is 'weakly dependent' on Y . 

The following fact is useful. Let Z = Z(T) = At(Y). Then Zj ~ N(vi,l) where 
v = u(T) = Ay(/it), and 

R(T,y) = E^WAriY) - y\\ 2 2 = E U \\A(Z) - v\\\ = R(I,u). (1) 

In the last equality / represents the identity transformation. When there is no real danger 
of confusion, the dependence on T is suppressed. We will use equation ([IJ later to establish 
an estimator R(T, fj,) for R(T, fjt). 

The following general three steps method, for estimating /j,, suggests itself. 

Step I: For every T, estimate R(T, /x) by R(T, fj,). 

Step II: Find T = argmin T R(T, y). 

Step III: Get the estimator: ft = f~ 1 (A(f(Y))) = A f (Y). 
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We summarize. The idea in this subsection is that by an appropriate affine transfor- 
mation, that may depend on explanatory variables, we will arrive to a problem which is 
'easier' for the procedure A to handle. For example, by choosing an appropriate wavelet 
basis we will arrive to a sparse u, which, roughly, is easier to handle/estimate the sparser it 
is. More generally, in a rough sense, good permutaion invariant procedures A, "prefer" to 
estimate sparse vectors u, hence transforming the original problem to a sparse problem is 
useful. Indeed accounting for the explanatory variables in a 'good' way, often brings us to 
a correponding sparse v. Moreover, by accounting for explanatory variables in a good way 
through a suitable transformation, we may obtain (nearly) exchangeable variables; whence, 
applying a permutation invariant procedure A on the transformed variables becomes natural 
and appealing. 

2.2 The case where A is parametric empirical Bayes and the Fay- 
Herriot procedure. 

In this subsection we study the case where A is a parametric empirical Bayes that corresponds 
to the prior iV(0,r 2 ), where r 2 is unknown. When r 2 is known the corresponding Bayes 
estimator for is fa = -^r{Yi, and its risk is ^fri- When r 2 is unknown, we replace r 2 
by its estimate. For our level of asymptotics all consistent estimators f 2 induce equivalent 

* 2 

estimators fii = ^r{Yi, and the corresponding estimators are asymptotically equivalent to 
James-Stein estimator up to o(n), see Efron and Morris (1973). By working in this level of 
asymptotic, the considerations in this subsection are valid for a wide class of PEB procedures, 
corresponding to various consistent methods of estimating r 2 , including the J-S procedure. 
In particular, the risk in estimating a (deterministic) vector fi by PEB (or James-stein's) 
method equals: 

77.1 |/xl || , . 

,, ;" 2 +o(n). 

We now examine our three steps estimation scheme, adapted for parametric Empirical 
Bayes (or, for a James-Stein estimator A). Note that, for every T and the corresponding v 

II 1 1 2 

and Zi we have: R(I, v) = ™„]p+ n + o(n). Hence a plausible estimator for R(T, /x) is 

R(T, M ) = R(I, u) = max{0, "^"^ } = maxjo, ) ( 2 ) 

1 (EZf - n)+n) I }2 Z f J 

Our three steps scheme adapted for parametric empirical Bayes A is the following. 
Step I: For every T estimate R(T, ft) by 
Step II: Find T = argmin T R(T, /j,). 

Step III: Get the estimator: (i = f^ 1 (A(f(Y))) = A f (Y). 

Remark 2 In the case where {T} corresponds to {B = X(3 : (3 e R p }, the optimization 
step II is trivial. We want to minimize the residuals Zf. This is achieved for B which is 
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the projection ofY on the span of the columns of X, i.e., for T(Y) = Y — X(3, where /3 is 
the ordinary least squares estimator. Upon realizing the last fact, it is easy to see that our 
above suggested method is the method of Fay and Herriot. 



2.3 A nonparametric empirical Bayes A 

The statements and development in this sub-section are for nonparametric empirical Bayes 
procedure A, as in Brown Greenshtein (2009), see appendix. 

Let Zi ~ N(ui, 1) be independent. Denote by lZ(u), the Bayes risk that corresponds to 
the prior which is defined by the empirical distribution of v. Let f u — - ^ 4>{z — z/j), where 
is the density of a standard normal distribution. Then 

see Bickel and Collins (1983). 

The following theorem follows from Brown and Greenshtein (2009). It is stated for a 
triangular array set up, in order to cover situations of sparse u = v n . At stage n, Yi ~ 
N(n2, 1) are independent and for any corresponding sequence T n , T n e {T n }, Zi ~ N(v™, 1) 
are independent, i — 1, . . . , n. 

Assumption 1 For every a > and every sequence T n and the corresponding v n we have 
maxj(z/™) — mini ( i/") = o(n a ). 

Assumption 2 For some ao > 0, n^'^lZ^ 11 ) — > oo for every T n and corresponding v n . 

Theorem 1 Under Assumptions 1 and 2, for every sequence T n , 

R(I,u n ) = £U||A(Z) - u n \\l = (1 + o(l))nll(v n ) (4) 

As explained in the appendix, the procedure A in Brown and Greenshtein requires a band- 
width h = h n , which approaches slowly to zero. The rate that implies the result in Theorem 
[]]is h n ^J\og{n) — > oo. 

Given Yi ~ N([Ai, 1), and a transformation T, T 6 {T}. Let Zi be the i'th coordinate 
of Z = T{Y). The last theorem, and equations ([T]) and suggest the following estimator 
R(T,fj,) for R(T,n), 

R{T^) = n-Y J [^f^}\ (5) 

iu\Zi) 

where the density /„ and its derivative are estimated, for example, by appropriate kernel 
estimates. 

Only step I of our general three steps procedure should be adapted, and replaced by: 
Step I: For every T and corresponding v = u(T), estimate R(T, /j,) by (J5j). 
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Remark 3 Step II could be computationally very complicated when the set {T} is large. 
In the case where {T} corresponds to {B = Xfl : (3 G R p }, a plausible choice, which 
is computationally convenient is to use the least-squares residuals for T(Y), as in the PEB 
case. This choice could be far from optimal as will be demonstrated in the following Examples 
[3] and [7] and in the simulations section. 

Note, minimizing R(I, v) with respect to u — u(T) is equivalent to finding the "most 
favorable" prior, rather than the more conventional task of finding the least favorable prior. 

The above method is reasonable when the class {T} is not too large (in a VC dimension 
sense) and the overfit affect is not significant, otherwise regularization may be required. Those 
considerations are beyond the scope of our paper. 

Choosing the least squares residuals, as mentioned in the remarks above, may be very 
inefficient, since it might cause "smoothing" of the empirical distribution and low values of 
(/~) 2 , which by Q implies high risk. This could be caused, e.g., by transforming a sparse 
structure into a non-sparse one, as in the following Example [3j or by transforming a structure 
with well separated groups into a mixed structure, as in the Example HI 

Example 3 7,™ N(l,l), i = l,...,2m, 2m = n. Suppose we have only one (useless) 
explanatory variable Xi = 1 if % < m and otherwise. Projecting Y on X , we get 
B pa (1, . . . , 1, 0, . . . , 0)' and u = fi — B pa (0, . . . , 0, —1, . . . , —1)', which is much worst 
for empirical Bayes estimation than the original [i: It is easy to see that nTZ(0) = 0(n), 
while nlZ(fj.) = 0. From Theorem^ we conclude that as n — )■ oo the advantage of the latter 
(trivial) transformation compared to the least squares residuals in terms of the risk is o(n) 
compared to 0(n). 

Example 4 Let Y{ ~ N(fa, 1) are independent, where fa = fa for i = 1, . . . , m and fa = 
—fa for i = m + l,...,2m = n. Suppose Xi = (fa + Wj) ~ N(fa,l), independent of 
Yi, i = 1, ... ,n. Let v = fi — B where B is the projection of Y on the (random) vector 
X = (Xi, . . . , X n )' . It easy to check that Vi — > fa/ (fa 2 + 1)— //|Wi/(//i + 1) as n — )■ oo. When 
fa — > oo, the empirical distribution of v = v n converges to that of a standard normal. The 
corresponding Bayes risk IZ(u n ) converges to 0.5. Obviously the Bayes risk that corresponds 
to the trivial transformation, for which v n = /j, n , converges to zero. 

2.4 Optimality of NPEB A. 

Until this point the treatment was for a concrete procedure A and a class {T} of transfor- 
mations. The purpose of this section is to advocate the choice of a non-parametric empirical 
Bayes A, which is denoted A NP . 

However, as noted, the optimization step (Step II) in the non-parametric approach may 
be computationally intensive, so such dominance result might not be enough to persuade 
that the non-parametric approach might be a good alternative to the parametric approach 
and to the Fay Herriot procedure. In Theorem [2] below we show that for every two sequences 
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fi n and T n , the sequence of estimators, that is obtained by coupling T n with A NP , asymp- 
totically dominates the sequence which is obtained when coupling the same T n with any 
other sequence of permutation invariant procedures A n . 

Given a procedure A, a transformation T, and a mean vector /j,, the corresponding risk 
is denoted for simplicity as Ra(T,/j,) = R(T, /x) as before; for the case of nonparametric 
EB procedure A NP , the corresponding risk is denoted R NP (T, /x). Similarly to the previous 
sub-section our asymptotic analysis is of a triangular array setup. 

Theorem 2 Letfi n , A n andT n be arbitrary sequences. Assume that for each n the procedure 
A™ is simple symmetric. Further assume Assumptions 1,2. Then: 

hm sup — f < 1 . 

Proof: Follows from Brown and Greenshtein (2009) and Theorem 1. Note that, the risk of 
the optimal simple symmetric procedure equals nR{y n ). 

Conjecture: It seems that in Theorem 2, the condition that A™ are simple symmetric 
for every n, may be replaced by the weaker condition, that A™ are permutation invariant 
for every n. This should follow by an equivalence result in the spirit of Greenshtein and 
Ritov (2009), though stronger. Note, the equivalence result in Greenshtein and Ritov (2009) 
would suffice under the assumption that maxj(z/™) — minj(z/™) = 0(1); however, Assumption 
1 allows a higher order. 



2.5 Remark 

The following remark is for the case in which we are mainly interested, where {T} corresponds 
to {B = Xj3}. Denote B = (Bi, . . . , B n )' . In the application we have in mind the set {T} 
may be random since Xij could be random. When the random set of transformations is 
independent of Y, our above treatment applies by conditioning on the explanatory variables. 
We will be interested in situations where the random set {T} may depend on Y, however 
we will require that is independent of Xn,...,X ip for each i. Then the conditional 
distribution of Zi conditional on Xn, . . . , X ip is N(ui, 1), where (vi, . . . , u n )' — u — Afi — B 
as before. When the dependence of on Xji, . . . ,Xj p , j ^ i is not too heavy, a natural 
goal is still to try to approximate the best decision function for estimating Vi among the 
decision functions which are simple symmetric with respect to Zi, . . . , Z n . The conditional 
marginal distribution of Zj, i — 1, . . . , n is still Nfa, 1) (i.e., the conditional distribution of 
Zi = Yi—Bi conditional upon (X a , . . . , X ip )); however, we may not treat them as independent 
observations. Thus, the rates of estimating f v and its derivative may become slower, and 
for heavy dependence, Theorems 1 and 2 might not hold. Similarly, rates of estimation of 
t%, in order to apply the PEB procedure, could be slow. However, when the dependence is 
not "too heavy" we may expect Theorems 1 and 2 to hold under the assumption that Yj, is 
independent of X h , . . . , X ip for each i. 
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3 Simulation 



3.1 Preliminaries 

The city of Tel Aviv is divided into 161 small areas called "statistical areas", each area 
belongs to a sub-quarter that includes about four additional statistical areas. In the recent 
Israeli census the following the proportion p^ of people who are registered in area i among 
those who live in area i, , % = 1, . . . 161, were of interest as part of the process of estimating 
the population in each statistical area. The estimated pi, i — 1, . . . ,n are used in order to 
adjust the administrative- registration counts and get population estimates for each area. We 
will not elaborate on it. In our simulation we use as p±, . . . ,p n their value as estimated in 
the recent census (where about 20% of the population was sampled). The mean of p iy i = 
1, . . . , 161, is 0.75 and their standard deviation is 0.13, their histogram is roughly bell shaped. 

We will present a simulation study in which p iy i = 1, . . . , 161 are estimated based on 
samples of size rrii and corresponding simulated independent Y/, Y- ~ B(mi,pi). Here Y( is 
the number of people in the sample from area i, which are registered to area i. 

In addition we will simulate covariates. We will simulate temporal variables that corre- 
spond to historical data from each area i, and spatial covariates, that correspond to samples 
from the neighboring areas of each area i. In the following we will explore simulations and 
scenarios for the cases of: only temporal covariates, only spatial covariates, and both tem- 
poral and spatial covariates. We will compare the performance of PEB, NPEB and other 
methods. In all the analyzed situations, we will simulate binomial observations with sample 
size rrii = m, for m = 25, 50, 100. 

In order to reduce this setup to the above normal case, we apply an arcsin transformation 
on our binomial observations Y h i — 1, . . . , n, as in Brown (2008). Specifically, let 



r- — / Yt + 0.25 \ . , 

F^v^arcsm^-^j. (6) 

Then, Yi are distributed approximately as iV ( y/Am arcsin ( v /p^) , 1). We estimate /ij = E(Yi), 
by jii, i — 1, . . . , n, as explained in sub-sections 2.3 and 2.3, and then let the estimate of pi, 
i = 1, . . . , 161 equal, 

= (sm(-§M) 2 . (7) 

Let p = (pi, . . . ,p n ) and p = (pi, . . . ,£>„)). We evaluate the performance of an estimator 
according to the risk 

E p\\p-P\\l- 

The risk is approximated through 1000 simulations for each entry in the tables in the sequel. 
A different parametric EB approach for estimating proportions in small areas, that involves 
a logistic regression model, may be found in Farell, et.al. 
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3.2 Temporal Covariates 

We introduce now simulated scenarios with only Temporal covariates. We think of a process 
where each year a sample of size m is taken from each area. Suppose we use the records 
of the previous three years as covariates. Let Tj be the number of people among the 3m 
that were sampled in the previous three years from area %, which were registered to the 
area. Although Tj might be better modeled as a binomial mixture, we will model % as 
B(3m,p it ) for simplicity. In order to (hopefully) have a linear relation between the response 
and explanatory variable, we define our temporal covariates as: 

Note, if there is hardly any change from the previous three years to the current year in area 
i, we will have pi ps p it and E(Ti) ps E(Yi). 

In the following we will simulate two scenarios. One scenario is of no-change where 
Pit = Pi, i = 1, . . . , 161. The other scenario is of a few abrupt changes; specifically, pi = 
Pu, i = 17, ... , 161, however p it = 0.3 < pi for i — 1, . . . , 16. Such abrupt changes could 
occur in areas that went in previous years through a lot of building, internal immigration 
and other changes. 

Since the empirical distribution of E(Yi) is roughly bell-shaped it is expected that the 
PEB method will work well in the no-change scenario. Under the few abrupt changes, an 
advantage of the NPEB procedure will be observed. 

As mentioned in Section 2, the optimization step of the NPEB procedure is difficult. 
We will try two candidate transformations Y — B % , i = 1,2, coupled with the NPEB, 
the corresponding methods are denoted NPEB1 and NPEB2. NPEB1 corresponds to the 
least-squares/Fay-Herriot transformation, while NPEB2 corresponds to the transformation 
Zi = Yi — Ti (i.e., B 2 = (T 1 ,...,T n )' ). The later transformation, although still sub- 
optimal when coupled with a NPEB A, could occasionally perform better than the former, 
as also indicated by Examples 3 and 4. In addition to comparing the risks of the PEB and 
NPEBi, i — 1,2 methods, we will also compare the the risk of the naive estimator, and of 
the regression estimator. The regression estimator estimates fii through the least squares 
predictor (i.e., ft = X(3), however it does not apply an additional PEB or NPEB stage. The 
Naive estimator simply estimates Pi by the corresponding sample proportion. 

The no-change scenario is presented in Table 1. Each entry is based on 1000 simulated 
realizations. Under no-change the temporal covariate is very helpful, and even the regression- 
estimate, i.e. least squares linear predictor is doing very well. Over all, the Naive estimator is 
the worst, NPEBI, NPEB2 and Regression are about the same, while the PEB is moderately 
better than the other methods. 

Next we consider the scenario of a few abrupt changes. In this scenario the regression by 
itself is performing the worst, however an additional EB step is helpful. Here the NPEB2 
procedure is the best, see Table 2. 
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Table 1: 





Naive Reg NPEB1 NPEB2 PEB 


m = 25 
m = 50 
m = 100 


1.12 0.33 0.35 0.37 0.27 
0.56 0.17 0.18 0.18 0.14 
0.28 0.092 0.093 0.093 0.073 



Table 2: 





Naive Reg NPEB1 NPEB2 PEB 


m = 25 
m = 50 

771 = 100 


1.12 1.66 0.75 0.49 0.68 
0.56 1.64 0.46 0.22 0.42 
0.28 1.62 0.26 0.11 0.24 
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Table 3: 





Naive Reg NPEB1 NPEB2 PEB 


m = 25 
m = 50 
m = 100 


1.12 1.41 0.72 0.75 0.64 
0.56 1.34 0.44 0.44 0.40 
0.28 1.31 0.26 0.28 0.23 



3.3 Spatial Covariates 

In this section we simulate a scenario with spatial covariates. Tel- Aviv is divided into sub- 
quarters, where a few statistical areas define a sub-quarter. Each sub-quarter is defined by 
about 5 statistical areas. For every i — 1, . . . , 161, we define the neighborhood of area i", as 
all the statistical areas other than area i, that belong to the same sub-quarter as area i. 

Based on the census we have good estimates for p is - the proportion of people living in 
the neighborhood of area i, who are registered to their areas. Those estimates are treated 
as the "real" values in our simulations. The correlation between p^ and p is , i = 1, . . . , 161 is 
0.62. 

For simplicity we will assume that for each i, the size of the sample from the neighborhood 
of area i is 4m. Let Si be the number of people sampled from the neighborhood of i, who 
are registered to their area. Although Si might be better modeled as a binomial mixture, 
we will model Si as Si ~ B(4m,pi S ) for simplicity. As in the case of Temporal covariates we 
define the Spatial covariate for area i as: 

Si = V^marcsinf \\ A l + °' 25 V (9) 
Vy 4m + 0.5 / v ; 

As in the temporal case we will consider two NPEB estimates, corresponding to the project ion/Fay- 
Herriot and to the Zi — Yi — Si transformations. The results of our simulations are summa- 
rized in Table 3. The advantage of the EB procedures is more noticeable for small m = 25. 
The explanation is the following. Since the temporal covariate is not very strong, v-the 
mean of the transformed variables is not too sparse. When m is large, under the scale which 
is induced by the variance of Zi, the points i/j, % = l,...,n, may be viewed as isolated 
(i.e., extremely non sparse) and the smoothing of the EB is hardly effective. Hence the EB 
methods behave roughly like the Naive estimator. 

One could wonder whether the spatial covariates are helpful at all, for the non parametric 
empirical Bayes, i.e, may be it is better not to transform the data at all and to apply A^p 
on the original data taking T = I and v = fi. However this option is slightly worst than the 
above ones. The simulated risks that correspond to m = 25, 50, 100 are 0.84 , 0.5 and 0.28. 
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Table 4: 



Naive Reg NPEB1 NPEB2 NPEB3 NPEB4 PEB 



m = 25 1.12 1.13 0.65 
m = 50 0.56 1.06 0.4 
m = 100 0.28 1.03 0.24 



0.49 0.54 0.55 0.58 
0.22 0.28 0.38 0.37 
0.11 0.15 0.22 0.22 



3.4 Spatial and Temporal Covariates. 

In this sub-section we study the performances of our estimators when both the temporal 
and spatial variables are introduced. As before we will apply the projection transformation 
for the NPEB estimator. However, we will also try the transformations Zj = Yj — (aSi + 
(1 — a)Ti), for a = 0,0.3,0.6. The corresponding estimators are denoted: NPEB1 (for the 
projection), NPEB2, NPEB3 and NPEB4 correspondingly. For the temporal covariates we 
simulate the scenario of 16 abrupt changes, the spatial covariates are as before. As may 
be expected, since the spatial covariate is weak relative to the temporal, accounting for it 
causes extra unnecessary smoothing. For the non-parametric EB procedure, indeed NPEB2 
that corresponds to a = has the best performance, which is also the optimal among all the 
seven methods. 



4 Appendix 

NPEB procedure. We will follow the approach of Brown and Greenshtein (2009), see that 
paper for further details. 

Assume Z % ~ N(ui, a 2 ), i = 1, . . . , n, where ~ G. 

Let 

f(z) = [ ^( Z —)dG(v). 
J a a 

It may be shown that the normal Bayes procedure denoted 8%, satisfies: 

%(z)=z + o*fj&. (10) 

The procedure studied in Brown and Greenshtein (2009), involves an estimation of 8^, by 
replacing / and /' in (|T0|) by their kernel estimators which are derived through a normal 
kernel with bandwidth h. Denote the kernel estimates by fh and f' h we obtain the decision 
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function, (Zi, . . . , Z n ) x z >->• R: 

5 N , h (z) = z + <x 2 (11) 
fh(z) 

A suitable (straightforward) truncation is applied when estimating the corresponding 
mean of points Zi for which f{ZA is too close to zero and consequently \b~N,h{Zi) — Z{\ > 
21og(n). We did not apply such truncation in our simulations in this paper. The default 
choice for the bandwidth h = h n , suggested by Brown and Greenshtein is 1/ ^/log(n). See 
also, a cross-validation method for choosing h, suggested by Brown, et.al., (2010), together 
with some suggested improvements of the above procedure. In our numerical study, n — 161 
and we chose h = 0.4. The procedure is not too sensitive to the choice of h. 
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